Model Selection

High-precision image captioning

# High-precision image captioning

Amoral Gemma3 12B Vision

Vision-enhanced version based on soob3123/amoral-gemma3-12B, combining Gemma3-12B large language model with visual encoder for multimodal tasks

Transformers English

Pixtral is a multimodal model based on the Mistral architecture, capable of processing both image and text inputs to generate detailed textual descriptions.

mistral-community

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase